Another Two - Level Failure Recovery Scheme : Performance
نویسنده
چکیده
This report deals with the design and evaluation of a \two-level" failure recovery scheme for distributed systems. In our previous work 30, 32], we motivated a \two-level" recovery approach that tolerates the more probable failures with a low overhead, and less probable failures with possibly higher overhead. The two-level approach can achieve a smaller overhead as compared to traditional recovery schemes. The contributions of this report are summarized below: We present and evaluate a \two-level" recovery scheme that is suitable for a network of workstations, each workstation having a local disk. The recovery scheme presented in the report can tolerate transient processor failures with a low overhead , while other failures require a larger overhead. The report presents analysis of the average (expected) task completion time using the proposed scheme. This scheme has been implemented on a workstation cluster. Our analysis indicates that the proposed two-level recovery scheme can achieve better performance as compared to existing \one-level" recovery schemes. The report also evaluates the impact of checkpoint latency on the performance of the recovery scheme. To our knowledge, no analysis of the performance impact of checkpoint latency has been carried out previously. Experimental measurements of checkpoint latency and checkpoint overhead for four applications are presented. References 32, 30] present material related to this report. The interested reader can obtain these references via anonymous ftp from ftp.cs.tamu.edu:/pub/vaidya. y This report was revised several times in January 1995. The purpose of these revisions was to add Sections 10 and 11, and to revise Section 1.
منابع مشابه
A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism
Fault tolerance overhead of high performance computing (HPC) applications is becoming critical to the efficient utilization of HPC systems at large scale. HPC applications typically tolerate fail-stop failures by checkpointing. Another promising method is in the algorithm level, called algorithmic recovery. These two methods can achieve high efficiency when the system scale is not very large, b...
متن کاملTwo-Level Incremental Checkpoint Recovery Scheme for Reducing System Total Overheads
Long-running applications are often subject to failures. Once failures occur, it will lead to unacceptable system overheads. The checkpoint technology is used to reduce the losses in the event of a failure. For the two-level checkpoint recovery scheme used in the long-running tasks, it is unavoidable for the system to periodically transfer huge memory context to a remote stable storage. Therefo...
متن کاملA Fast Rollback-Recovery Scheme based on Optimistic Message Logging
This paper presents an eecient rollback recovery scheme based on the optimistic message logging. To speed up the recovery process, the rollback point of the failed process is broadcast and other processes asynchronously make the rollback decision based on the vector time. Asynchronous recovery process usually causes two possible problems: One is the message delivered from an invalid state inter...
متن کاملAn Efficient Rerouting Scheme for MPLS-Based Recovery and Its Performance Evaluation
The path recovery in MPLS is the technique to reroute traffic around a failure or congestion in a LSP. Currently, there are two kinds of model for path recovery: rerouting and protection switching. The existing schemes based on rerouting model have the disadvantage of more difficulty in handling node failures or concurrent node faults. Similarly, the existing schemes based on protection switchi...
متن کاملA Case for Multi-Level Distributed Recovery Schemes
Most of the distributed recovery schemes proposed in the literature are designed to tolerate arbitrary number of failures, with a few notable exceptions of schemes designed to tolerate single failures. In this report, we demonstrate that, it is often advantageous to use \multi-level" recovery schemes. A \multi-level" recovery scheme is one that can tolerate diierent number of faults at diierent...
متن کامل